{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 搭建一个分词工具\n",
    "\n",
    "### 基于枚举方法来搭建中文分词工具\n",
    "\n",
    "此项目需要的数据，[现代汉语常用词表.txt](https://github.com/liangqi/chinese-frequency-word-list)： \n",
    "1. 包含了中文词，当做词典来用\n",
    "2. 提供了词频，相当于可以变相得到unigram概率 word_prob\n",
    "\n",
    "\n",
    "举个例子： 给定词典=[我们 学习 人工 智能 人工智能 未来 是]， 另外我们给定unigram概率：p(我们)=0.25, p(学习)=0.15, p(人工)=0.05, p(智能)=0.1, p(人工智能)=0.2, p(未来)=0.1, p(是)=0.15\n",
    "\n",
    "#### Step 1: 对于给定字符串：”我们学习人工智能，人工智能是未来“, 找出所有可能的分割方式\n",
    "- [我们，学习，人工智能，人工智能，是，未来]\n",
    "- [我们，学习，人工，智能，人工智能，是，未来]\n",
    "- [我们，学习，人工，智能，人工，智能，是，未来]\n",
    "- [我们，学习，人工智能，人工，智能，是，未来]\n",
    ".......\n",
    "\n",
    "\n",
    "#### Step 2: 我们也可以计算出每一个切分之后句子的概率\n",
    "- p(我们，学习，人工智能，人工智能，是，未来)= -log p(我们)-log p(学习)-log p(人工智能)-log p(人工智能)-log p(是)-log p(未来)\n",
    "- p(我们，学习，人工，智能，人工智能，是，未来)=-log p(我们)-log p(学习)-log p(人工)-log p(智能)-log p(人工智能)-log p(是)-log p(未来)\n",
    "- p(我们，学习，人工，智能，人工，智能，是，未来)=-log p(我们)-log p(学习)-log p(人工)-log p(智能)-log p(人工)-log p(智能)-log p(是)-log p(未来)\n",
    "- p(我们，学习，人工智能，人工，智能，是，未来)=-log p(我们)-log p(学习)-log p(人工智能)-log p(人工)-log p(智能)-log(是)-log p(未来)\n",
    ".....\n",
    "\n",
    "#### Step 3: 返回第二步中概率最大的结果"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: 第一步： 现代汉语常用词表.txt中读取所有中文词及对应频率。\n",
    "from collections import Counter\n",
    "import random\n",
    "import numpy as np\n",
    "\n",
    "with open('./data/现代汉语常用词表.txt') as f:\n",
    "    lines = f.readlines()\n",
    "\n",
    "sum = 0\n",
    "word_prob = Counter()\n",
    "\n",
    "for line in lines:\n",
    "    columns = line.strip().split()\n",
    "    # 重复词频率直接相加，（相同词多次出现是因为发音不同，即语义也不同，这里不做区分）\n",
    "    word_prob[columns[0]] += int(columns[-1])\n",
    "    sum += int(columns[-1])\n",
    "\n",
    "# 频率转为概率\n",
    "for word in word_prob:\n",
    "    word_prob[word] /= sum"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[{'高个儿': 1.8414889771925325e-05}, {'传送带': 2.20334194331537e-05}, {'调遣': 1.4040430031872929e-05}]\n",
      "词典大小：55735\n",
      "1.0\n"
     ]
    }
   ],
   "source": [
    "# 测试\n",
    "print([{word: word_prob[word]} for word in random.sample(word_prob.keys(), 3)])\n",
    "print(\"词典大小：%d\" % len(word_prob))\n",
    "print(np.sum(list(word_prob.values())))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "## TODO 根据词典获得一句话的全切分，暂不考虑标点等\n",
    "def sentence_break(str):\n",
    "    \"\"\"\n",
    "    求该句话在当前词典下的全切分。\n",
    "    思路：状态转移，设M[i]是从句子开始到第i个字所组成句的全切分，word是以字i结尾的可在词典中找到的词，则M[i] = M[i-len(word)] + word\n",
    "    \n",
    "    str: 字符串，传入的句子\n",
    "    \"\"\"\n",
    "    \n",
    "    # 存储状态\n",
    "    memory = [[] for _ in range(len(str))]\n",
    "    \n",
    "    for i in range(0, len(str)):\n",
    "        for j in range(0, i+1):\n",
    "            # 从开始到当前cursor视为一个词\n",
    "            if j == 0:\n",
    "                if str[j:i+1] in word_prob:\n",
    "                    memory[i].append([str[j:i+1]])\n",
    "                continue\n",
    "            # 确定依赖的之前状态存在且（达成转移条件：词存在）\n",
    "            if memory[j-1] and str[j:i+1] in word_prob:\n",
    "                # 状态转移过程\n",
    "                for state in memory[j-1]:\n",
    "                    memory[i].append(state + [str[j:i+1]])\n",
    "\n",
    "    return memory[-1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[['北京', '的', '天气', '真', '好', '啊'], ['北', '京', '的', '天气', '真', '好', '啊'], ['北京', '的', '天', '气', '真', '好', '啊'], ['北', '京', '的', '天', '气', '真', '好', '啊']]\n",
      "[['今天', '的', '课程', '内容', '很', '有意思'], ['今', '天', '的', '课程', '内容', '很', '有意思'], ['今天', '的', '课', '程', '内容', '很', '有意思'], ['今', '天', '的', '课', '程', '内容', '很', '有意思'], ['今天', '的', '课程', '内', '容', '很', '有意思'], ['今', '天', '的', '课程', '内', '容', '很', '有意思'], ['今天', '的', '课', '程', '内', '容', '很', '有意思'], ['今', '天', '的', '课', '程', '内', '容', '很', '有意思'], ['今天', '的', '课程', '内容', '很', '有', '意思'], ['今', '天', '的', '课程', '内容', '很', '有', '意思'], ['今天', '的', '课', '程', '内容', '很', '有', '意思'], ['今', '天', '的', '课', '程', '内容', '很', '有', '意思'], ['今天', '的', '课程', '内', '容', '很', '有', '意思'], ['今', '天', '的', '课程', '内', '容', '很', '有', '意思'], ['今天', '的', '课', '程', '内', '容', '很', '有', '意思'], ['今', '天', '的', '课', '程', '内', '容', '很', '有', '意思'], ['今天', '的', '课程', '内容', '很', '有意', '思'], ['今', '天', '的', '课程', '内容', '很', '有意', '思'], ['今天', '的', '课', '程', '内容', '很', '有意', '思'], ['今', '天', '的', '课', '程', '内容', '很', '有意', '思'], ['今天', '的', '课程', '内', '容', '很', '有意', '思'], ['今', '天', '的', '课程', '内', '容', '很', '有意', '思'], ['今天', '的', '课', '程', '内', '容', '很', '有意', '思'], ['今', '天', '的', '课', '程', '内', '容', '很', '有意', '思'], ['今天', '的', '课程', '内容', '很', '有', '意', '思'], ['今', '天', '的', '课程', '内容', '很', '有', '意', '思'], ['今天', '的', '课', '程', '内容', '很', '有', '意', '思'], ['今', '天', '的', '课', '程', '内容', '很', '有', '意', '思'], ['今天', '的', '课程', '内', '容', '很', '有', '意', '思'], ['今', '天', '的', '课程', '内', '容', '很', '有', '意', '思'], ['今天', '的', '课', '程', '内', '容', '很', '有', '意', '思'], ['今', '天', '的', '课', '程', '内', '容', '很', '有', '意', '思']]\n",
      "[['经常', '有', '意见', '分歧'], ['经', '常', '有', '意见', '分歧'], ['经常', '有意', '见', '分歧'], ['经', '常', '有意', '见', '分歧'], ['经常', '有', '意', '见', '分歧'], ['经', '常', '有', '意', '见', '分歧']]\n"
     ]
    }
   ],
   "source": [
    "# 测试\n",
    "print(sentence_break(\"北京的天气真好啊\"))\n",
    "print(sentence_break(\"今天的课程内容很有意思\"))\n",
    "print(sentence_break(\"经常有意见分歧\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "## TODO 编写word_segment_naive函数来实现对输入字符串的分词\n",
    "import math\n",
    "\n",
    "def word_segment_naive(input_str):\n",
    "    \"\"\"\n",
    "    1. 对于输入字符串做分词，并返回所有可行的分词之后的结果。\n",
    "    2. 针对于每一个返回结果，计算句子的概率\n",
    "    3. 返回概率最高的最作为最后结果\n",
    "    \n",
    "    input_str: 输入字符串   输入格式：“今天天气好”\n",
    "    best_segment: 最好的分词结果  输出格式：[\"今天\"，\"天气\"，\"好\"]\n",
    "    \"\"\"\n",
    "\n",
    "    # TODO： 第一步： 计算所有可能的分词结果，要保证每个分完的词存在于词典里，这个结果有可能会非常多。 \n",
    "    segments = sentence_break(input_str)  # 存储所有分词的结果。如果次字符串不可能被完全切分，则返回空列表(list)\n",
    "                                          # 格式为：segments = [[\"今天\"，“天气”，“好”],[\"今天\"，“天“，”气”，“好”],[\"今“，”天\"，“天气”，“好”],...]\n",
    "    \n",
    "    # TODO: 第二步：循环所有的分词结果，并计算出概率最高的分词结果，并返回\n",
    "    best_segment = list()\n",
    "    best_score = 0\n",
    "    for seg in segments:\n",
    "        # TODO ...\n",
    "        if seg:\n",
    "            score = 0\n",
    "            for word in seg:\n",
    "                # 防止下溢，取log\n",
    "                score += math.log(word_prob[word])\n",
    "            if best_score == 0:\n",
    "                best_segment = seg\n",
    "                best_score = score\n",
    "            else:\n",
    "                if score > best_score:\n",
    "                    best_segment = seg\n",
    "                    best_score = score\n",
    "    \n",
    "    return best_segment"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['北京', '的', '天气', '真', '好', '啊']\n",
      "['今天', '的', '课程', '内容', '很', '有意思']\n",
      "['经常', '有意', '见', '分歧']\n"
     ]
    }
   ],
   "source": [
    "# 测试\n",
    "print(word_segment_naive(\"北京的天气真好啊\"))\n",
    "print(word_segment_naive(\"今天的课程内容很有意思\"))\n",
    "print(word_segment_naive(\"经常有意见分歧\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 基于维特比算法来优化上述流程\n",
    "\n",
    "原方法将分词和计算概率分开，这里使用状态转移的方法将二者同时进行。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [],
   "source": [
    "## TODO 编写word_segment_viterbi函数来实现对输入字符串的分词\n",
    "import math\n",
    "\n",
    "def word_segment_viterbi(input_str):\n",
    "    \"\"\"\n",
    "    1. 基于输入字符串，词典，以及给定的unigram概率来创建DAG(有向图）。\n",
    "    2. 编写维特比算法来寻找最优的PATH\n",
    "    3. 返回分词结果\n",
    "    \n",
    "    input_str: 输入字符串   输入格式：“今天天气好”\n",
    "    best_segment: 最好的分词结果  输出格式：[\"今天\"，\"天气\"，\"好\"]\n",
    "    \"\"\"\n",
    "    \n",
    "    # TODO: 第一步：根据词典，输入的句子，以及给定的unigram概率来创建带权重的有向图（Directed Graph）\n",
    "    #      有向图的每一条边是一个单词的概率（只要存在于词典里的都可以作为一个合法的单词），这些概率在 word_prob，如果不在word_prob里的单词但在\n",
    "    #      词典里存在的，统一用概率值1e-100。\n",
    "    # 图是为了直观起见，边表示字或词及其概率，节点存储状态，图有没有其实无所谓，从本质上讲其实就是个状态转移算法\n",
    "    # 每个节点的状态包含-log(P)和当前最优切分\n",
    "    memory = [[0, []] for _ in range(len(input_str)+1)]\n",
    "\n",
    "    # TODO： 第二步： 利用维特比算法来找出最好的PATH， 这个PATH是P(sentence)最大或者 -log P(sentence)最小的PATH。\n",
    "    # TODO: 第三步： 根据最好的PATH, 返回最好的切分\n",
    "    for i in range(1, len(input_str)+1):\n",
    "        for j in range(i):\n",
    "            # 这里偷个懒，默认没有形成词的单字可以在词典中找到（如果不成立事实上会返回完整句子，因为-log(1e-100)必然小于该值加某个非负数\n",
    "            word = input_str[j:i]\n",
    "            prob = word_prob[word] if word in word_prob else 1e-100\n",
    "\n",
    "            score = memory[j][0] - math.log(prob)\n",
    "            # 状态更新\n",
    "            if memory[i][0] == 0:\n",
    "                memory[i][0] = score\n",
    "                memory[i][1] = memory[j][1] + [word]\n",
    "            else:\n",
    "                if score < memory[i][0]:\n",
    "                    memory[i][0] = score\n",
    "                    memory[i][1] = memory[j][1] + [word]\n",
    "\n",
    "    return memory[-1][1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['北京', '的', '天气', '真', '好', '啊']\n",
      "['今天', '的', '课程', '内容', '很', '有意思']\n",
      "['经常', '有意', '见', '分歧']\n"
     ]
    }
   ],
   "source": [
    "# 测试\n",
    "print(word_segment_viterbi(\"北京的天气真好啊\"))\n",
    "print(word_segment_viterbi(\"今天的课程内容很有意思\"))\n",
    "print(word_segment_viterbi(\"经常有意见分歧\"))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.1"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
